Improve testing of `{Scheduler|Worker}MetricCollector` by hendrikmakait · Pull Request #6945 · dask/distributed

hendrikmakait · 2022-08-24T13:38:12Z

Closes #6943.

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-08-24T15:17:21Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 53m 10s ⏱️ + 10m 13s
  3 054 tests +1   2 968 ✔️ - 1   84 💤 ±0 2 ❌ +2
22 592 runs +7 21 610 ✔️ +7 980 💤 - 2 2 ❌ +2

For more details on these failures, see this check.

Results for commit 51f9781. ± Comparison against base commit 817ead3.

♻️ This comment has been updated with latest results.

hendrikmakait · 2022-08-25T08:13:25Z

Failing tests:

Flaky distributed/tests/test_client.py::test_reconnect #6896
test_worker_start_exception
flaky test_local_directory #6934
flaky test_local_directory_make_new_directory #6935

New ones that look similar to other mysterious test timeouts we've had:

test_nanny_death_timeout
test_num_fds

gjoseph92 · 2022-08-29T20:31:24Z

-        PrometheusHandler._initialized = True
+        PrometheusHandler._collector = WorkerMetricCollector(self.server)
+        # Register collector
+        prometheus_client.REGISTRY.register(PrometheusHandler._collector)


If there are multiple concurrent Worker instances in the same process (very common with async workers in all gen_cluster tests), this means PrometheusHandler._collector will only be set to the WorkerMetricCollector of the last worker to start. Is that okay? Does the prometheus_client.REGISTRY.register(PrometheusHandler._collector) mean that all the collectors will be registered somewhere anywhere?

I don't know what either _collector or prometheus_client.REGISTRY are for, I'm just seeing some global variables getting overwritten in a place where I know there might not just be multiple workers sequentially, but also multiple workers in parallel.

The way this had been implemented before was that we'd simply set up a PrometheusHandler for the first worker spinning up in the process, set _initialized to True and call it a day. This led to test_prometheus_collect_task_states because that PrometheusHandler did not belong to the newly spun-up worker when running multiple tests. This new pattern surely isn't perfect as we only ever have a single instance of a PrometheusHandler per process talking to the last worker we spun up. However, IMO this is better than what we had before and it would require some more serious thinking to understand the Prometheus client to handle parallel workers gracefully. I'm happy to file a follow-up issue for that.

Regarding the singleton implementation: I've stolen that from https://github.com/hendrikmakait/distributed/blob/13e315c11c7277ba0cadd9f5c2a16364cdeaf14b/distributed/http/scheduler/prometheus/core.py#L79

This was pretty painful when we started testing this stuff. I'm OK with either solution and don't think we need anything more sophisticated for now.

As was already pointed out, this is merely an artifact for testing and no real world application would ever run multiple workers in the same process AND use prometheus. At the very least, this is something we then choose to not properly support

TLDR As long as the testing works out, I'm happy

gjoseph92 · 2022-08-29T20:31:53Z

        port = s.http_server.port
        response = await http_client.fetch(f"http://localhost:{port}/metrics")
+        assert response.code == 200
+        assert response.headers["Content-Type"] == "text/plain; version=0.0.4"


Do we need to assert the version? I'd think it's okay if that changes?

Mostly added this for coherence with other tests assuming someone had a reason to add this. I don't mind dropping this everywhere.

Dropped during refactoring.

gjoseph92 · 2022-08-29T20:37:56Z

-            family.name: family for family in text_string_to_metric_families(txt)
+            family.name: family
+            for family in text_string_to_metric_families(txt)
+            if family.name.startswith("dask_scheduler_")


This implies to me that worker metrics are now getting mixed in when you hit the scheduler port in tests? Probably because they're all running on the same async server?

I wonder if we need to use an async-friendly prometheus client, like https://github.com/hynek/prometheus-async or, more likely, https://github.com/claws/aioprometheus (since it seems to support creating multiple Service instances per process)?

This is less about worker metrics (though I am not 100% sure that's not happening), but more about additional Python-related metrics that we collect but don't want to test for in here.

I wonder if we need to use an async-friendly prometheus client, like https://github.com/hynek/prometheus-async or, more likely, https://github.com/claws/aioprometheus (since it seems to support creating multiple Service instances per process)?

I'm open to it, but that's outside of the scope of this PR.

This implies to me that worker metrics are now getting mixed in when you hit the scheduler port in tests? Probably because they're all running on the same async server?

They shouldn't, we have different collectors on different server endpoints. They also use worker/scheduler-specific methods to gather metrics to display.

Async clients look like an interesting thing but the actual value they provide is a bit unclear to me.

I guess this can make the collection a bit smoother (currently it's a plain generator) such that it doesn't block the event loop during collection. Once we have more experience with this we can revisit this. Last time I was using prometheus I didn't encounter any loop instabilities

gjoseph92 · 2022-08-29T20:38:57Z

+    # request data twice since there once was a case where metrics got registered multiple times resulting in
+    # prometheus_client errors
+    await fetch_metrics()


Is this still necessary? We're just making sure it doesn't error the second time?

This is the behavior we currently have on main. I'm open to changing this but I also lack context as to why this has been necessary at some point in time.

gjoseph92 · 2022-08-29T20:40:49Z


    http_client = AsyncHTTPClient()

+    async def fetch_metrics():


Seems like fetch_metrics could be factored out to a utils_test function?

fjetter · 2022-08-30T12:52:55Z

@gjoseph92 I leave merging to you

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

fjetter · 2022-08-31T09:19:58Z

merging once CI is done

hendrikmakait · 2022-08-31T10:35:20Z

CI flakes: #4955, #6967 (hasn't been merged into this branch IIRC)

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

hendrikmakait self-assigned this Aug 24, 2022

hendrikmakait marked this pull request as ready for review August 25, 2022 08:04

hendrikmakait requested review from crusaderky, fjetter, gjoseph92 and graingert August 25, 2022 12:34

gjoseph92 reviewed Aug 29, 2022

View reviewed changes

hendrikmakait requested a review from gjoseph92 August 30, 2022 11:45

fjetter approved these changes Aug 30, 2022

View reviewed changes

hendrikmakait and others added 14 commits August 31, 2022 10:27

Check for all expected metrics

fbcdb57

Check for all expected metrics in scheduler prometheus collector

d2a437f

Fix test name

84d7656

Test prometheus worker task states

0834fc8

Refactor to event-based test

981aad0

Improve singleton pattern

b4a3aa1

Fix test

ac6876d

Update distributed/http/worker/tests/test_worker_http.py

f79e003

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update distributed/http/worker/tests/test_worker_http.py

a6dd39a

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update distributed/http/worker/tests/test_worker_http.py

c034689

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update distributed/http/worker/prometheus/core.py

c20f075

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Fix typos

35a9700

Extract fetch_metrics

157d25b

Fix lint

51f9781

hendrikmakait force-pushed the improve-worker-metric-collector-testing branch from 801c4cd to 51f9781 Compare August 31, 2022 08:27

fjetter merged commit cefd9db into dask:main Aug 31, 2022

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Improve testing of {Scheduler|Worker}MetricCollector (dask#6945)

4c9bb19

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Uh oh!

Uh oh!

Conversation

hendrikmakait commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

hendrikmakait commented Aug 25, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fjetter commented Aug 30, 2022

Uh oh!

fjetter commented Aug 31, 2022

Uh oh!

hendrikmakait commented Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hendrikmakait commented Aug 24, 2022 •

edited

Loading

github-actions Bot commented Aug 24, 2022 •

edited

Loading